Systems and methods for executing a programmable finite state machine that accelerates fetchless computations and operations of an array of processing cores of an integrated circuit

ABSTRACT

Systems and methods for fetchless acceleration of convolutional loops on an integrated circuit include identifying, by a compiler, finite state machine (FSM) initialization parameters based on computational requirements of a computational loop; initializing a programmable FSM based on the FSM initialization parameters, wherein the FSM initialization parameters include a loop iteration parameter identifying a number of computation cycles of the computational loop; executing the programmable FSM to enable fetchless computations by: generating a plurality of computational loop control signals including a distinct computation loop control signal for each of the number of computation cycles of the computational loop based on the loop iteration parameter; and controlling an execution of a plurality of computation cycles of a computational circuit performing the computational loop based on transmitting the plurality of computational loop control signals until the number of computation cycles of the computation loop are completed.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation of U.S. patent application Ser. No.17/953,312, filed 26 Sep. 2022, which claims the benefit of U.S.Provisional Application No. 63/235,775, filed 22 Aug. 2021, and U.S.Provisional Application No. 63/407,258, filed 16 Sep. 2022, which areincorporated herein in their entireties by this reference.

TECHNICAL FIELD

The one or more inventions described herein relate generally to theintegrated circuitry field, and more specifically to a new and usefulperception and dense algorithm processing integrated circuitryarchitecture in the integrated circuitry field.

BACKGROUND

Modern applications of artificial intelligence and generally, machinelearning appear to be driving innovations in robotics and specifically,in technologies involving autonomous robotics and autonomous vehicles.Also, the developments in machine perception technology have enabled theabilities of many of the implementations in the autonomous robotics' andautonomous vehicles' spaces to perceive vision, perceive hearing, andperceive touch among many other capabilities that allow machines tocomprehend their environments.

The underlying perception technologies applied to these autonomousimplementations include a number of advanced and capable sensors thatoften allow for a rich capture of environments surrounding theautonomous robots and/or autonomous vehicles. However, while many ofthese advanced and capable sensors may enable a robust capture of thephysical environments of many autonomous implementations, the underlyingprocessing circuitry that may function to process the various sensorsignal data from the sensors often lack in corresponding robustprocessing capabilities sufficient to allow for high performance andreal-time computing of the sensor signal data.

The underlying processing circuitry often include general purposeintegrated circuits including central processing units (CPUs) andgraphic processing units (GPU). In many applications, GPUs areimplemented rather than CPUs because GPUs are capable of executing bulkyor large amounts of computations relative to CPUs. However, thearchitectures of most GPUs are not optimized for handling many of thecomplex machine learning algorithms (e.g., neural network algorithms,etc.) used in machine perception technology. For instance, theautonomous vehicle space includes multiple perception processing needsthat extend beyond merely recognizing vehicles and persons. Autonomousvehicles have been implemented with advanced sensor suites that providea fusion of sensor data that enable route or path planning forautonomous vehicles. But modern GPUs are not constructed for handlingthese additional high computation tasks.

At best, to enable a GPU or similar processing circuitry to handleadditional sensor processing needs including path planning, sensorfusion, and the like, additional and/or disparate circuitry may beassembled to a traditional GPU. This fragmented and piecemeal approachto handling the additional perception processing needs of robotics andautonomous machines results in a number of inefficiencies in performingcomputations including inefficiencies in sensor signal processing.

Accordingly, there is a need in the integrated circuitry field for anadvanced integrated circuit and processing techniques that are capableof high performance and real-time processing and computing of routineand advanced sensor signals for enabling perception of robotics or anytype or kind of perceptual machine.

The inventors of the inventions described in the present applicationhave designed an integrated circuit architecture and one or moreprocessing techniques that allow for enhanced sensor data processingcapabilities and have further discovered related methods forimplementing the integrated circuit architecture for several purposesincluding for enabling perception of robotics and various machines.

BRIEF DESCRIPTION OF THE FIGURES

FIG. 1 illustrates a schematic of a system 100 in accordance with one ormore embodiments of the present application;

FIG. 2 illustrates an example method 200 in accordance with one or moreembodiments of the present application;

FIGS. 3A-3B illustrate example mixed block and flow diagrams forinstructions generation in accordance with one or more embodiments ofthe present application;

FIG. 4 illustrates an example mixed block and flow schematic forfetchless computations in accordance with one or more embodiments of thepresent application;

FIG. 5 illustrates an example mixed block and flow schematic forfetchless computations and data rotations in accordance with one or moreembodiments of the present application; and

FIG. 6 illustrates example mixed block and flow diagrams for datamovements and data rotations in accordance with one or more embodimentsof the present application.

BRIEF SUMMARY OF THE INVENTION(S)

In one embodiment, a method for fetchless acceleration of convolutionalloops on an integrated circuit includes identifying, by a compiler,finite state machine (FSM) initialization parameters based onconvolution requirements of one or more convolutional loops within aneural network graph; initializing a programmable FSM based on the FSMinitialization parameters, wherein the FSM initialization parametersinclude at least a loop iteration parameter comprising a required numberof computation cycles of a convolutional loop; at runtime, implementingthe programmable FSM to fetchless computations by: (i) generating aplurality of convolutional loop control signals based on the FSMinitialization parameters; and (ii) transmitting the plurality ofconvolutional loop control signals to one or more matrix accumulatorcircuits (MACs) of a plurality of distinct processing cores; andcontrolling, by the programmable FSM, an execution of a plurality ofcomputation cycles of the one or more MACs performing a convolutionalloop until a number of computation cycles of the convolutional loop iscompleted.

In one embodiment, initializing the programmable FSM based on the FSMinitialization parameters includes: (i) programming a starting memoryaddress parameter at a start memory address register file of theprogrammable FSM; (ii) programming a convolution filter size parameterat a convolution register file of the programmable FSM; and (iii)programming iteration parameters at an iteration register file of theprogrammable FSM.

In one embodiment, the programmable FSM is in direct command signalcommunication with each of the plurality of distinct processing cores ofan array of processing cores.

In one embodiment, the method includes identifying, by the compiler, FSMinitialization parameters includes: computing a memory start addressparameter identifying a memory address location within a local memory ofeach of the plurality of distinct processing cores.

In one embodiment, a method for implementing FSM-controlledconvolutional computations on an integrated circuit includes identifyingFSM programming instructions based on a neural network graph;configuring a programmable FSM based on the FSM programminginstructions, wherein the programmable FSM controls: (a) operations ofmultiply accumulators of a plurality of distinct processing cores, and(b) operations of data ports of the plurality of distinct processingcores; and wherein configuring the programmable FSM includes: (1)initializing an address register file of the FSM with a starting memoryaddress value, (2) initializing a convolutional register file of the FSMwith a convolutional filter size value; (3) initializing at least oneiteration register file of the programmable FSM with an iteration valueidentifying a number of cycles of a convolutional loop performed by themultiply accumulators; based on the initialization of the programmableFSM, starting the programmable FSM causing the programmable FSM togenerate control signals to the plurality of distinct processing coresbased the programming of at least the at least one iteration registerfile.

In one embodiment, a method for implementing fetchless acceleration ofcomputational loops on an integrated circuit includes identifying, by acompiler, finite state machine (FSM) initialization parameters based oncomputational requirements of a computational loop within a neuralnetwork graph; initializing a programmable FSM based on the FSMinitialization parameters, wherein the FSM initialization parametersinclude a loop iteration parameter including a number of computationcycles of the computational loop; at runtime, implementing theprogrammable FSM to enable fetchless computations by: (i) generating, bythe programmable FSM, a plurality of computational loop control signalsincluding a distinct computation loop control signal for each of thenumber of computation cycles of the computational loop based on the loopiteration parameter; and (ii) controlling, by the programmable FSM, anexecution of a plurality of computation cycles of a computationalcircuit performing the computational loop based on transmitting theplurality of computational loop control signals until the number ofcomputation cycles of the computation loop are completed.

In one embodiment, the FSM initialization parameters further includes aloop iteration and data movement parameter including (a) a distinctnumber of computation cycles of the computation loop and (b) at leastone data movement instruction that, when executed, moves input data froma first register file of a processing core to a second register file ofthe processing core.

In one embodiment, the first register file is associated with a firstdata port of the processing core and the second register file isassociated with a second data port of the processing; and when the datamovement instruction is executed causes the input data rotate an anglefrom the first data port to the second data port.

In one embodiment, implementing the programmable FSM includes:generating, by the programmable FSM, a data movement control signal foreach distinct number of computation cycles of the computation loop basedon the loop iteration and data movement parameter.

In one embodiment, controlling the execution of the plurality ofcomputation cycles of the computational circuit includes transmitting,by the programmable FSM, the data movement control signal for eachdistinct number of computation cycles until the distinct number ofcomputation cycles of the computation loop are completed.

In one embodiment, initializing the programmable FSM based on the FSMinitialization parameters includes encoding a starting memory addressparameter to a start memory address register file accessible to one ormore computational circuits controllable by the programmable FSM.

In one embodiment, the starting memory address parameter includes aregister file pointer that points to a head of the input data at alocation within an n-dimensional memory stored within at least oneprocessing core controllable by the programmable FSM.

In one embodiment, initializing the programmable FSM based on the FSMinitialization parameters includes encoding a convolution filter sizeparameter to a convolution register file of at least one processing corecontrollable by the programmable FSM.

In one embodiment, the convolution filter size parameter includes avalue that maps to one of a plurality of distinct convolutional filtersizes for a given convolutional computation by a multiply accumulatorcircuit of the at least one processing core.

In one embodiment, initializing the programmable FSM based on the FSMinitialization parameters includes encoding the loop iteration parameterto a combination of distinct iteration register files of at least oneprocessing core controllable by the programmable FSM.

In one embodiment, at runtime, the programmable FSM executes thecomputational loop based on the loop iteration parameter, andsubsequently, the programmable FSM executes one or more computationalloops based on the loop iteration and data movement parameter.

In one embodiment, at runtime, the programmable FSM produces: a firstset of control signals of the plurality of computational loop controlsignals for executing the computational loop based on the loop iterationparameter; and in response to completing the computational loop based onthe loop iteration parameter, a second set of control signals of theplurality of computational loop control signals for executing (a) thedistinct number of computation cycles of the computation loop and (b)the at least one data movement instruction based on the loop iterationand data movement parameter.

In one embodiment, at runtime, the programmable FSM produces theplurality of controls signals causing an execution of an N-way multiplyaccumulate with computation weights and computation input data, Nrelates to a number of distinct multiply accumulate circuitsconcurrently executing a distinct computational loop, and N is greaterthan one.

In one embodiment, if convolutional filter size parameter of the FSMinitialization parameters includes a value that maps to one of aplurality of distinct convolutional filter sizes that is greater than a1×1 convolutional filter size, the programmable FSM broadcasts inputdata pointed to by the starting memory address parameter to a collectionof processing cores in neighboring proximity.

In one embodiment, a method for implementing fetchless acceleration ofconvolutional loops on an integrated circuit includes identifying, by acompiler, finite state machine (FSM) initialization parameters based oncomputational requirements of a convolutional loop within a neuralnetwork graph; initializing a programmable FSM based on the FSMinitialization parameters, wherein the FSM initialization parametersinclude a loop iteration parameter including a number of computationcycles of the convolutional loop; at runtime, implementing theprogrammable FSM to enable fetchless computations by: (i) generating, bythe programmable FSM, a plurality of convolutional loop control signalsbased on the loop iteration parameter; and (ii) controlling, by theprogrammable FSM, an execution of a plurality of computation cycles of amultiply accumulator circuit (MAC) performing the convolutional loopbased on transmitting the plurality of convolutional loop controlsignals until the number of computation cycles of the computation loopare completed.

In one embodiment, initializing the programmable FSM based on the FSMinitialization parameters includes: (i) programming a starting memoryaddress parameter at a start memory address register file accessible tothe MAC controllable by the programmable FSM; (ii) programming aconvolution filter size parameter at a convolution register fileaccessible to the MAC controllable by the programmable FSM; and (iii)programming iteration parameters at one or more iteration registers fileaccessible to the programmable FSM.

In one embodiment, the programmable FSM is in direct command signalcommunication with a plurality of distinct MACS operating on each of aplurality of distinct processing cores.

In one embodiment, the method includes identifying, by the compiler, FSMinitialization parameters includes: computing a memory start addressparameter including a memory address location within a local memory ofeach of the plurality of distinct processing cores.

In one embodiment, a method for implementing FSM-controlledconvolutional computations on an integrated circuit includes identifyingFSM programming instructions based on a neural network graph;configuring a programmable FSM based on the FSM programminginstructions, wherein the programmable FSM controls: (a) operations ofmultiply accumulators (MACs) of a plurality of distinct processingcores, and (b) data movement operations of data ports of the pluralityof distinct processing cores; and wherein configuring the programmableFSM includes: (1) programming a starting memory address value to anaddress register file accessible to MACs controllable by theprogrammable FSM, (2) programming a convolutional filter size to aconvolutional register file associated with the FSM, and (3) programmingat least one iteration register file associated with the programmableFSM with an iteration value identifying a number of cycles of aconvolutional loop performed by at least one of the MACs; and executinga Boolean switch based on the initialization of the programmable FSMthat starts an operation of the programmable FSM for generating controlsignals to the MACs for automatically executing distinct convolutionalloops.

DESCRIPTION OF THE PREFERRED EMBODIMENTS

The following description of preferred embodiments of the presentapplication are not intended to limit the inventions to these preferredembodiments, but rather to enable any person skilled in the art to makeand use these inventions.

Overview

In one or more embodiments of the present application, the systems andtechniques described herein may allow for enhanced scheduling andexecution of data transfers and computations, in parallel, that reduceslatency in the one or more processes of an integrated circuit. In suchembodiments, a scheduling of a plurality of memory transfers of inputsand outputs of computations of a computations network graph may be madein such a manner that enables overlaps with computations.

In the one or more embodiments, the methods and systems may function toconfigure an on-chip memory or data buffer that interfaces with arrayprocessing cores of the integrated circuit. In embodiments such asthese, the inputs for compute are preferably stored in the OCM and aresourced to the array processing cores and the outputs of the computersare preferably transferred from the array processing cores to and storedby the OCM. Since in some circumstances, the content of the OCM mayfunction to dictate the amount of compute that can be performed by thearray processing cores. Because of this, the one or more embodiments ofthe present application provide systems and techniques that configuresthe OCM to optimize for both memory transfers and computations. That is,rather than sequential memory transfers to the OCM and subsequentcomputations based on the memory content of the OCM, the memorytransfers are optimized for multiple parallel transfers into the OCMfrom a main memory based on the computation requirements of the arrayprocessing cores and the computations of the array processing cores maybe accelerated based on partial dependency encodings of the OCM thatallow computations to be performed by the OCM with only partial inputsstored in the OCM.

At least a few technical benefits of the above-noted embodiments of thepresent application includes the continuous and uninterruptedcomputations of the array processing cores based on the encoded partialdependencies of the OCM and the continuous and uninterrupted memorytransfers of inputs and outputs without the need to wait on thecompletion of the one or more computations at the array processingcores.

It shall also be recognized that the one or more embodiments of thepresent application may be implemented in any suitable processingenvironment including, but not limited to, within one or more IMDsand/or any suitable processing circuit.

The mesh architecture defined by the plurality of processing elements inthe array core preferably enable in-memory computing and data movement,as described in U.S. Pat. No. 10,365,860 and U.S. patent applicationSer. No. 16/292,537, which are incorporated herein in their entiretiesby this reference and further, enable a core-level predication and atile-level predication.

1. A System Architecture of a Dense Algorithm and/or PerceptionProcessing Circuit (Unit)

As shown in FIG. 1 , the integrated circuit 100 (dense algorithm and/orperception processing unit) for performing perception processingincludes a plurality of array cores 110, a plurality of border cores120, a dispatcher (main controller) 130, a first plurality of peripherycontrollers 140, a second plurality of periphery controllers 150, andmain memory 160. The integrated circuit 100 may additionally include afirst periphery load store 145, a second periphery load store 155, afirst periphery memory 147, a second periphery memory 157, a firstplurality of dual FIFOs 149, and a second plurality of dual FIFOs 159,as described in U.S. Pat. Nos. 10,365,860, 10,691,464, and U.S. patentapplication Ser. No. 16/292,537, which are all incorporated herein intheir entireties by this reference.

The integrated circuit 100 preferably functions to enable real-time andhigh computing efficiency of perception data and/or sensor data. Ageneral configuration of the integrated circuit 100 includes a pluralityof array core 110 defining central signal and data processing nodes eachhaving large register files that may eliminate or significantly reduceclock cycles needed by an array core 110 for pulling and pushing datafor processing from memory. The instructions (i.e.,computation/execution and data movement instructions) generatingcapabilities of the integrated circuit 100 (e.g., via the dispatcher 130and/or a compiler module 175) functions to enable a continuity and flowof data throughout the integrated circuit 100 and namely, within theplurality of array cores 110 and border cores 120.

An array core 110 preferably functions as a data or signal processingnode (e.g., a small microprocessor) or processing circuit andpreferably, includes a register file 112 having a large data storagecapacity (e.g., 1024 kb, etc.) and an arithmetic logic unit (ALU) 118 orany suitable digital electronic circuit that performs arithmetic andbitwise operations on integer binary numbers. In a preferred embodiment,the register file 112 of an array core 110 may be the only memoryelement that the processing circuits of an array core 110 may havedirect access to. An array core 110 may have indirect access to memoryoutside of the array core and/or the integrated circuit array 105 (i.e.,core mesh) defined by the plurality of border cores 120 and theplurality of array cores 110.

The register file 112 of an array core 110 may be any suitable memoryelement or device, but preferably comprises one or more staticrandom-access memories (SRAMs). The register file 112 may include alarge number of registers, such as 1024 registers, that enables thestorage of a sufficiently large data set for processing by the arraycore 110. Accordingly, a technical benefit achieved by an arrangement ofthe large register file 112 within each array core 110 is that the largeregister file 112 reduces a need by an array core 110 to fetch and loaddata into its register file 112 for processing. As a result, a number ofclock cycles required by the array core 112 to push data into and pulldata out of memory is significantly reduced or eliminated altogether.That is, the large register file 112 increases the efficiencies ofcomputations performed by an array core 110 because most, if not all, ofthe data that the array core 110 is scheduled to process is locatedimmediately next to the processing circuitry (e.g., one or more MACs,ALU, etc.) of the array core 110. For instance, when implementing imageprocessing by the integrated circuit 100 or related system using aneural network algorithm(s) or application(s) (e.g., convolutionalneural network algorithms or the like), the large register file 112 ofan array core may function to enable a storage of all the image datarequired for processing an entire image. Accordingly, most or if not,all layer data of a neural network implementation (or similarcompute-intensive application) may be stored locally in the largeregister file 112 of an array core 110 with the exception of weights orcoefficients of the neural network algorithm(s), in some embodiments.Accordingly, this allows for optimal utilization of the computing and/orprocessing elements (e.g., the one or more MACs and ALU) of an arraycore 110 by enabling an array core 110 to constantly churn data of theregister file 112 and further, limiting the fetching and loading of datafrom an off-array core data source (e.g., main memory, periphery memory,etc.).

By comparison, to traverse a register file in a traditional systemimplemented by a GPU or the like, it is typically required that memoryaddresses be issued for fetching data from memory. However, in apreferred embodiment that implements the large register file 112, the(raw) input data within the register file 112 may be automaticallyincremented from the register file 112 and data from neighboring core(s)(e.g., array cores and/or border cores) are continuously sourced to theregister file 112 to enable a continuous flow to the computing elementsof the array core 110 without an express need to make a request (orissuing memory addresses) by the array core 110.

While in some embodiments of the present application, a predetermineddata flow scheduled may mitigate or altogether, eliminate requests fordata by components within the integrated circuit array 105, in a variantof these embodiments traditional random memory access may be achieved bycomponents of the integrated circuit array 105. That is, if an arraycore 110 or a border core 120 recognizes a need for a random piece ofdata for processing, the array core 110 and/or the border 120 may make aspecific request for data from any of the memory elements within thememory hierarchy of the integrated circuit 100.

An array core 110 may, additionally or alternatively, include aplurality of multiplier (multiply) accumulators (MACs) 114 or anysuitable logic devices or digital circuits that may be capable ofperforming multiply and summation functions. In a preferred embodiment,each array core 110 includes four (4) MACs and each MAC 114 may bearranged at or near a specific side of a rectangular shaped array core110. While, in a preferred embodiment each of the plurality of MACs 114of an array core 110 may be arranged near or at the respective sides ofthe array core 110, it shall be known that the plurality of MACs 114 maybe arranged within (or possibly augmented to a periphery of an arraycore) the array core 110 in any suitable arrangement, pattern, position,and the like including at the respective corners of an array core 110.In a preferred embodiment, the arrangement of the plurality of MACs 114along the sides of an array core 110 enables efficient inflow or captureof input data received from one or more of the direct neighboring cores(i.e., an adjacent neighboring core) and the computation thereof by thearray core 110 of the integrated circuit 100.

Accordingly, each of the plurality of MACs 114 positioned within anarray core 110 may function to have direct communication capabilitieswith neighboring cores (e.g., array cores, border cores, etc.) withinthe integrated circuit 100. The plurality of MACs 114 may additionallyfunction to execute computations using data (e.g., operands) sourcedfrom the large register file 112 of an array core 110. However, theplurality of MACs 114 preferably function to source data for executingcomputations from one or more of their respective neighboring core(s)and/or a weights or coefficients (constants) bus 116 that functions totransfer coefficient or weight inputs of one or more algorithms(including machine learning algorithms) from one or more memory elements(e.g., main memory 160 or the like) or one or more input sources.

The weights bus 116 may be operably placed in electrical communicationwith at least one or more of periphery controllers 140, 150 at a firstinput terminal and additionally, operably connected with one or more ofthe plurality of array core 110. In this way, the weight bus 116 mayfunction to collect weights and coefficients data input from the one ormore periphery controllers 140, 150 and transmit the weights andcoefficients data input directly to one or more of the plurality ofarray cores 110. Accordingly, in some embodiments, multiple array cores110 may be fed weights and/or coefficients data input via the weightsbus 116 in parallel to thereby improve the speed of computation of thearray cores no.

Each array core 110 preferably functions to bi-directionally communicatewith its direct neighbors. That is, in some embodiments, a respectivearray core 110 may be configured as a processing node having arectangular shape and arranged such that each side of the processingnode may be capable of interacting with another node (e.g., anotherprocessing node, a data storage/movement node, etc.) that is positionednext to one of the four sides or each of the faces of the array core110. The ability of an array core no to bi-directionally communicatewith a neighboring core along each of its sides enables the array core110 to pull in data from any of its neighbors as well as push (processedor raw) data to any of its neighbors. This enables a mesh communicationarchitecture that allows for efficient movement of data throughout thecollection of array and border cores 110, 120 of the integrated circuit100.

Each of the plurality of border cores 120 preferably includes a registerfile 122. The register file 122 may be configured similar to theregister file 112 of an array core 110 in that the register file 122 mayfunction to store large datasets, as shown by way of example in FIG. 6 .Preferably, each border core 120 includes a simplified architecture whencompared to an array core 110. Accordingly, a border core 120 in someembodiments may not include execution capabilities and therefore, maynot include multiplier-accumulators and/or an arithmetic logic unit asprovided in many of the array cores 110.

In a traditional integrated circuit (e.g., a GPU or the like), wheninput image data (or any other suitable sensor data) received forprocessing compute-intensive application (e.g., neural networkalgorithm) within such a circuit, it may be necessary to issue paddingrequests to areas within the circuit which do not include image values(e.g., pixel values) based on the input image data. That is, duringimage processing or the like, the traditional integrated circuit mayfunction to perform image processing from a memory element that does notcontain any image data value. In such instances, the traditionalintegrated circuit may function to request that a padding value, such aszero, be added to the memory element to avoid subsequent imageprocessing efforts at the memory element without an image data value. Aconsequence of this typical image data processing by the traditionalintegrated circuit results in a number of clock cycles spent identifyingthe blank memory element and adding a computable value to the memoryelement for image processing or the like by the traditional integratedcircuit.

In a preferred implementation of the integrated circuit 100, one or moreof the plurality of border cores 120 may function to automatically setto a default value when no input data (e.g., input sensor data) isreceived. For instance, input image data from a sensor (or anothercircuit layer) may have a total image data size that does not occupy allborder core cells of the integrated circuit array 105. In such instance,upon receipt of the input image data, the one or more border cores 120(i.e., border core cells) without input image data may be automaticallyset to a default value, such as zero or a non-zero constant value.

In some embodiments, the predetermined input data flow schedulegenerated by the dispatcher and sent to one or more of the plurality ofborder cores may include instructions to set to a default or apredetermined constant value. Additionally, or alternatively, the one ormore border cores 120 may be automatically set to a default or apredetermined value when it is detected that 110 input sensor data orthe like is received with a predetermined input data flow to theintegrated circuit array 105. Additionally, or alternatively, in onevariation, the one or more border cores 120 may be automatically set toreflect values of one or more other border cores having input sensordata when it is detected that 110 input sensor data or the like isreceived with a predetermined input data flow to the integrated circuitarray 105.

Accordingly, a technical benefit achieved according to theimplementation of one or more of the plurality of border cores 120 asautomatic padding elements, may include increasing efficiencies incomputation by one or more of the plurality of array cores 110 byminimizing work requests to regions of interest (or surrounding areas)of input sensor data where automatic padding values have been set.Thereby, reducing clock cycles used by the plurality of array core 110in performing computations on an input dataset.

In a preferred implementation of the integrated circuit 100, theprogression of data into the plurality of array cores 110 and theplurality of border cores 120 for processing is preferably based on apredetermined data flow schedule generated at the dispatcher 130. Thepredetermined data flow schedule enables input data from one or moresources (e.g., sensors, other NN layers, an upstream device, etc.) to beloaded into the border cores 120 and array cores 110 without requiringan explicit request for the input data from the border cores 120 and/orarray cores no. That is, the predetermined data flow schedule enables anautomatic flow of raw data from memory elements (e.g., main memory 160)of the integrated circuit 100 to the plurality of border cores 120 andthe plurality of array cores 110 having capacity to accept data forprocessing. For instance, in the case that an array core 110 functionsto process a first subset of data of a data load stored in its registerfile 112, once the results of the processing of the first subset of datais completed and sent out from the array core 110, the predetermineddata flow schedule may function to enable an automatic flow of raw datainto the array core 110 that adds to the data load at the register file112 and replaces the first subset of data that was previously processedby the array core 110. Accordingly, in such instance, 110 explicitrequest for additional raw data for processing is required from thearray core 110. Rather, the integrated circuit 100 implementing thedispatcher 130 may function to recognize that once the array core 110has processed some amount of data sourced from its register file 112 (orelsewhere) that the array core 110 may have additional capacity toaccept additional data for processing.

In a preferred embodiment, the integrated circuit 100 may be in operablecommunication with an instructions generator 170 that functions togenerate computation, execution, and data movement instructions, asshown by way of example in FIGS. 3A-3B. The instructions generator 170may be arranged off-chip relative to the components and circuitry of theintegrated 100. However, in alternative embodiments, the instructionsgenerator 170 may be cooperatively integrated within the integratedcircuit 100 as a distinct or integrated component of the dispatcher 130.

Preferably, the instructions generator 170 may be implemented using oneor more general purpose computers (e.g., a Mac computer, Linux computer,or any suitable hardware computer) or general-purpose computerprocessing (GPCP) units 171 that function to operate a compiler module175 that is specifically configured to generate multiple and/ordisparate types of instructions. The compiler module 175 may beimplemented using any suitable compiler software (e.g., a GNU CompilerCollection (GCC), a Clang compiler, and/or any suitable open-sourcecompiler or other compiler). The compiler module 175 may function togenerate at least computation instructions and execution instructions aswell as data movement instructions. In a preferred embodiment, atcompile time, the compiler module 175 may be executed by the one or moreGPCP units 171 to generate the two or more sets of instructionscomputation/execution instructions and data movement instructionssequentially or in parallel. In some embodiments, the compiler module175 may function to synthesize multiple sets of disparate instructionsinto a single composition instruction set that may be loaded into memory(e.g., instructions buffer, an external DDR, SPI flash memory, or thelike) from which the dispatcher may fetch the single compositioninstruction set from and execute.

In a first variation, however, once the compiler module 175 generatesthe multiple disparate sets of instructions, such as computationinstructions and data movement instructions, the instructions generator170 may function to load the instructions sets into a memory (e.g.,memory 160 or off-chip memory associated with the generator 170). Insuch embodiments, the dispatcher 130 may function to fetch the multiplesets of disparate instructions generated by the instructions generator170 from memory and synthesize the multiple sets of disparateinstructions into a single composition instruction set that thedispatcher may execute and/or load within the integrated circuit 100.

In a second variation, the dispatcher 130 may be configured withcompiling functionality to generate the single composition instructionset. In such variation, the dispatcher 130 may include processingcircuitry (e.g., microprocessor or the like) that function to createinstructions that include scheduled computations or executions to beperformed by various circuits and/or components (e.g., array corecomputations) of the integrated circuit 100 and further, createinstructions that enable a control a flow of input data through theintegrated circuit 100. In some embodiments, the dispatcher 130 mayfunction to execute part of the instructions and load another part ofthe instructions into the integrated circuit array 105. In general, thedispatcher 130 may function as a primary controller of the integratedcircuit 100 that controls and manages access to a flow (movement) ofdata from memory to the one or more other storage and/or processingcircuits of the integrated circuit 100 (and vice versa). Additionally,the dispatcher 130 may schedule control execution operations of thevarious sub-controllers (e.g., periphery controllers, etc.) and theplurality of array cores 110.

In some embodiments, the processing circuitry of the dispatcher 130includes disparate circuitry including a compute instruction generatorcircuit 132 and a data movement instructions generator circuit 134(e.g., address generation unit or address computation unit) that mayindependently generate computation/execution instructions and datatransfers/movements schedules or instructions, respectively.Accordingly, this configuration enables the dispatcher 130 to performdata address calculation and generation of computation/executioninstructions in parallel. The dispatcher 130 may function to synthesizethe output from both the computer instructions generator circuit 132 andthe data movement instructions generator circuit 134 into a singleinstructions composition that combines the disparate outputs.

The single instructions composition generated by the instructionsgenerator 170 and/or the dispatcher 130 may be provided to the one ormore downstream components and integrated circuit array 105 and allowfor computation or processing instructions and data transfer/movementinstructions to be performed simultaneously by these various circuits orcomponents of the integrated circuit 100. With respect to the integratedcircuit array 105, the data movement component of the singleinstructions composition may be performed by one or more of peripherycontrollers 140, 150 and compute instructions by one or more of theplurality of array cores 110. Accordingly, in such embodiment, theperiphery controllers 140, 150 may function to decode the data movementcomponent of the instructions and if involved, may perform operations toread from or write to the dual FIFOs 149, 159 and move that data fromthe dual FIFOs 149, 159 onto a data bus to the integrated circuit (orvice versa). It shall be understood that the read or write operationsperformed by periphery controllers 140, 150 may performed sequentiallyor simultaneously (i.e., writing to and reading from dual FIFOs at thesame time).

It shall be noted that while the compute instructions generator circuit132 and the data movement instructions generator circuit 134 arepreferably separate or independent circuits, in some embodiments thecompute instructions generator circuit 132 and the data movementinstructions generator circuit 134 may be implemented by a singlecircuit or a single module that functions to perform both computeinstructions generation and data movement instruction generation.

In operation, the dispatcher 130 may function to generate and schedulememory addresses to be loaded into one or more the periphery load store145 and the periphery load store 155. The periphery load stores 145, 155preferably include specialized execution units that function to executeall load and store instructions from the dispatcher 130 and maygenerally function to load or fetch data from memory or storing the databack to memory from the integrated array core. The first periphery loadstore 145 preferably communicably and operably interfaces with both thefirst plurality of dual FIFOs 149 and the first periphery memory 147.The first and the second periphery memory 147, 157 preferably compriseon-chip static random-access memory.

In configuration, the first periphery load store 145 may be arrangedbetween the first plurality of dual FIFOs 149 and the first peripherymemory 147 such that the first periphery load store 145 is positionedimmediately next to or behind the first plurality of dual FIFOs 149.Similarly, the second periphery load store 155 preferably communicablyand operably interfaces with both the second plurality of dual FIFOs 159and the second periphery memory 157. Accordingly, the second peripheryload store 155 may be arranged between the second plurality of dualFIFOs 159 and the second periphery memory 157 such that the secondperiphery load store 155 is positioned immediately next to or behind thesecond plurality of dual FIFOs 159.

In response to memory addressing instructions issued by the dispatcher130 to one or more of the first and the second periphery load stores145, 155, the first and the second periphery load stores 145, 155 mayfunction to execute the instructions to fetch data from one of the firstperiphery memory 147 and the second periphery memory 157 and move thefetched data into one or more of the first and second plurality of dualFIFOs 149, 159. Additionally, or alternatively, the dual FIFOs 149, 159may function to read data from a data bus and move the read data to oneor more of the respective dual FIFOs or read data from one or more ofthe dual FIFOs and move the read data to a data bus. Similarly, memoryaddressing instructions may cause one or more of the first and thesecond periphery load stores 145, 155 to move data collected from one ormore of the plurality of dual FIFOs 149, 159 into one of the first andsecond periphery memory 147, 157.

Each of the first plurality of dual FIFOs 149 and each of the secondplurality of dual FIFOs 159 preferably comprises at least two memoryelements (not shown). Preferably, the first plurality of dual FIFOs 149may be arranged along a first side of the integrated circuit array 105with each of the first plurality of dual FIFOs 149 being aligned with arow of the integrated circuit array 105. Similarly, the second pluralityof dual FIFOs 159 may be arranged along a second side of the integratedcircuit array 105 with each of the second plurality of dual FIFOs 159being aligned with a column of the integrated circuit array 105. Thisarrangement preferably enables each border 120 along the first side ofthe integrated circuit array 105 to communicably and operably interfacewith at least one of the first periphery controllers 145 and each border120 along the second side of the integrated circuit array 105 tocommunicably and operably interface with at least one of the secondperiphery controllers 155.

While it is illustrated in at least FIG. 1 that there are a first andsecond plurality of dual FIFOs, first and second periphery controllers,first and second periphery memories, and first and second load stores,it shall be noted that these structures may be arranged to surround anentire periphery of the integrated circuit array 105 such that, forinstance, these components are arranged along all (four) sides of theintegrated circuit array 105.

The dual FIFOs 149, 159 preferably function to react to specificinstructions for data from their respective side. That is, the dualFIFOs 149, 159 may be configured to identify data movement instructionsfrom the dispatcher 130 that is specific to either the first pluralityof dual FIFOs 149 along the first side or the second plurality of dualFIFOs along the second side of the integrated circuit array 105.

According to a first implementation, each of the dual FIFOs may usefirst of the two memory elements to push data into the integratedcircuit array 105 and second of the two memory elements to pull datafrom the integrated circuit array 105. Thus, each dual FIFO 149, 159 mayhave a first memory element dedicated for moving data inward into theintegrated circuit array 105 and a second memory element dedicated formoving data outward from the integrated circuit array 105.

According to a second implementation, the dual FIFOs may be operated ina stack (second) mode in which each respective dual FIFO functions toprovide data into the integrated circuit array 105 in a predeterminedsequence or order and collect the data from the integrated circuit array105 in the same predetermined sequence or order.

Additionally, the integrated circuit 100 preferably includes main memory160 comprising a single unified memory. The main memory 160 preferablyfunctions to store data originating from one or more sensors,system-derived or generated data, data from one or more integratedcircuit layers, data from one or more upstream devices or components,and the like. Preferably, the main memory 160 comprises on-chip staticrandom-access memory or the like.

Additionally, or alternatively, main memory 160 may include multiplelevels of on-die (on-chip) memory. In such embodiments, the main memory160 may include multiple memory (e.g., SRAM) elements that may be inelectrical communication with each other and function as a singleunified memory that is arranged on a same die as the integrated circuitarray 105.

Additionally, or alternatively, main memory 160 may include multiplelevels of off-die (off-chip) memory (not shown). In such embodiments,the main memory 160 may include multiple memory (e.g., DDR SRAM, highbandwidth memory (HBM), etc.) elements that may be in electricalcommunication with each other and function as a single unified memorythat is arranged on a separate die than the integrated circuit array.

It shall be noted that in some embodiments, the integrated circuit 100includes main memory 160 comprising memory arranged on-die and off-die.In such embodiments, the on-die and the off-die memory of the mainmemory 160 may function as a single unified memory accessible to theon-die components of the integrated circuit 100.

Each of the first periphery memory 147 and the second periphery memory157 may port into the main memory 160. Between the first peripherymemory 147 and the main memory 160 may be arranged a load store unitthat enables the first periphery memory 147 to fetch data from the mainmemory 160. Similarly, between the second periphery memory 157 and themain memory 160 may be arranged a second load store unit that enablesthe second periphery memory 157 to fetch data from the main memory 160.

It shall be noted that the data transfers along the memory hierarchy ofthe integrated circuit 100 occurring between dual FIFOs 149, 159 and theload stores 145, 155, between the load stores 145, 155 and the peripherymemory 147, 157, and the periphery memory 147, 157 and the main memory160 may preferably be implemented as prescheduled or predetermineddirect memory access (DMA) transfers that enable the memory elements andload stores to independently access and transfer data within the memoryhierarchy without direct invention of the dispatcher 130 or some mainprocessing circuit. Additionally, the data transfers within the memoryhierarchy of the integrated circuit 100 may be implemented as 2D DMAtransfers having two counts and two strides thereby allowing forefficient data access and data reshaping during transfers. In apreferred embodiment, the DMA transfers may be triggered by a status oroperation of one or more of the plurality of array cores 110. Forinstance, if an array core is completing or has completed a processingof first set of data, the completion or near-completion may trigger theDMA transfers to enable additional data to enter the integrated circuitarray 105 for processing.

2. Method for Accelerating Computations and Operations of a ProcessingCore

As shown by way of example in FIG. 2 , a method 200 for a fetchlessacceleration of computational loops and operations of a processing coreincludes computing FSM initialization parameters S210, encoding aprogrammable FSM using FSM initialization parameters S220, executing theprogrammable FSM S230, and performing FSM-controlled fetchlesscomputations S240.

2.10 Computing FSM Initialization Parameters

S210, which includes identifying FSM initialization parameters, mayfunction to determine each of a plurality of distinct FSM initializationparameters for encoding and/or programming a target programmable FSM ofan integrated circuit. In one or more embodiments, S210 may function togenerate the FSM initialization parameters based at least on attributesof neural network computations of a neural network program orapplication. Additionally, or alternatively, in some embodiment, S210may function to identify a subset of the FSM initialization parametersbased at least on attributes of an n-dimensional matrix or vector ofdata, such as an n-dimensional tensor.

In one or more embodiments, the FSM initialization parameters may relateto a set of distinct initialization or parameters values that may beencoded to one or more circuits of a programmable FSM that enables theprogrammable FSM to control operations of computational circuits of oneor more processing cores for an accelerated and reduced power operationof the processing cores when performing computations on data.

In one embodiment, identifying the FSM initialization parameters mayinclude implementing a neural network compiler or similar softwareapplications that preferably functions to generate instructions for theprogrammable FSM and various other circuits of an integrated circuit. Insuch embodiments, at compile time, a software compiler may take in asinput a neural network graph or neural network task graph (or the like)to generate the FSM initialization parameters/instructions. In oneimplementation, once the software compiler generates the FSMinitialization parameters, S210 may function to store the FSMinitialization parameters in an instructions buffer and at runtime, theFSM initialization parameters may be encoded to the programmable FSM.

In one or more embodiments, generating the FSM initialization parametersinclude generating a starting memory address parameter for eachcomputation circuit (e.g., multiply accumulator circuit of a processorcore) of a processing core. The starting memory address parameterpreferably relates to a memory address pointer (e.g., register filepointer), such as a tensor offset, that identifies a head of input datawithin an n-dimensional matrix or vector being stored on a local memoryof a processing core. In one embodiment, a target programmable FSM maybe in control signal communication with a plurality of distinctprocessing cores and each distinct processing core having a local memory(or a register file relatively larger than registers or data ports orthe like) and one or more multiply accumulator circuits (MACs).

In one or more embodiments, generating the FSM initialization parametersincludes generating a convolutional type parameter for each computationcircuit of a processing core. The convolutional type parameterpreferably relates to a convolutional filter size (e.g., 3×3, 5×5, andthe like). In a preferred embodiment, the convolutional type parametermay be a value (e.g., 0, 1, 2, 3, and the like) that maps to a distinctconvolutional filter size. As a non-limiting example, a convolutionaltype parameter value of 1 may map to a 3×3 convolution filter size and avalue of 2 may map to a 5×5 convolution filter size. It shall berecognized that the convolutional type parameter values may be mapped toa convolution filter of any size.

In one or more embodiments, generating the FSM initialization parametersmay include generating iteration parameters for each computer circuit ofa processing core. The iteration parameters, in some embodiments, mayinclude at least two distinct parameters including a first iterationparameter (e.g., center channel loop) that may inform a computationcircuit of a number of instances, cycles, or loops that the computationcircuit should perform a multiply-accumulate computation and a seconditeration parameter (e.g., a data rotation loop) that may additionallyinform a movement of input data from a local memory (e.g., register fileof processing core) to the data ports of its neighboring processingcores for computation, and a number of computations to be performed bythe computation circuit.

Additionally, or alternatively, in one or more embodiments, the seconditeration parameter or data rotation loop parameter may include at leasttwo distinct parameter components or values and a required executionalsequence. In a non-limiting example, the second iteration parameter mayinclude a first parameter value that informs a type of data movement orrotation together with a starting data movement location, e.g., [East180°, second parameter value]. In such an embodiment, an indication of astarting location or starting data port (e.g., East data port) for datamovement may be indicated using cardinal directions together with anindication of a degree of rotation of input data stored within thestarting data port to a destination data port of a processing core. Itshall be recognized that the first parameter value could be separatedinto two distinct parameter values that operate together to identify astorage location of target input data and a required movement of thetarget input data to a particular neighboring processing core. Thesecond parameter value of the second iteration parameter may include anumber of cycles of a (convolutional) computation (e.g., [East 180°, 7]to be performed by a given computation circuit (e.g., a MAC unit). Insome embodiments, the second iteration parameter and/or the firstiteration parameter may additionally include a MAC unit identifier, suchas East, North, West, South MAC unit or the like, which depends on aconfiguration of a given processing core. It shall be recognized thatany suitable identifier may be used.

2.20 Encoding/Programming the Programmable FSM

S220, which includes initializing the programmable FSM, may function toencode a programmable FSM using at least the FSM initializationparameters. In a preferred embodiment, the programmable FSM may beimplemented by a plurality of distinct register files or processorregisters. Accordingly, at runtime, S220 may function to program orencode the register files of the programmable FSM using the FSMinitialization parameters as encoding values.

In a preferred embodiment, initializing or configuring the programmableFSM may include encoding each distinct FSM initialization parameter to adistinct register file or a distinct set of registers of theprogrammable FSM. In such preferred embodiment, the programmable FSM mayinclude one or more of a start address register, convolution typeregister, and iteration registers. Each of the start address register,the convolution type register, and iteration registers may beimplemented by the programmable FSM to generate command signals, controlsignals, and programming signals to one or more computation circuits ofone or more processing cores, as shown by way of example in FIG. 4 .

Memory Start Address|Start Address Register

In a preferred implementation, the programmable FSM comprises a registerfile configured to store a memory start address value or pointer. Theregister file may be referred to herein as a start address registerfile. Accordingly, at runtime or the like, initializing the programmableFSM may include programming the start address register file and, in suchembodiments, S220 may function to encode the start address register fileof the programmable FSM using a start address value of the FSMinitialization parameters. Once encoded, in one or more embodiments, thestart address register file of the programmable FSM may include a memoryaddress location or offset of a tensor that identifies a head of theinput data for a given convolution computation by a processing core.

In a second implementation, at runtime, S220 may function to directlyencode a local register file of a processing core with the start addressparameter value. In such embodiments, S220 may function to bypassencoding the programmable FSM with the start address parameter value anddistribute the start address parameter value or instructions to each ofthe final or end targets of the start address parameter instructions.

In one or more embodiments, a programmable FSM may include a pluralityof distinct start address register files based on a number of processingcores that the programmable FSM may be in control or command signalcommunication. That is, the programmable FSM may be hard wired to eachof the multiple distinct processing cores and may function to controlcomputations across the multiple distinct processing cores. As such, inthese embodiments, the programmable FSM may include a distinct startaddress register file for each of the multiple distinct processingcores. Accordingly, S220 may function to encode each of the plurality ofdistinct start address register files of the programmable FSM with adistinct start address parameter value for distinct computations of eachof the multiple distinct processing cores.

Convolution Type Parameter|Convolution Type Register

In a preferred embodiment, the programmable FSM comprises a registerfile configured to store a convolution type parameter value. Theregister file may be referred to herein as a convolution register file.At runtime, in such preferred embodiment, S220 may function to encodethe convolution register file using a convolution parameter value of theFSM initialization parameters. In an encoded state, the convolutionregister file of the programmable FSM may include a single bit or n-bitvalue that maps to one of a plurality of distinct convolution filtersizes.

In one or more embodiments, a programmable FSM may include a pluralityof distinct convolution register files based on a number of processingcores or a number of computing circuits (e.g., a number of MACs) thatthe programmable FSM may be in control or command signal communication.That is, the programmable FSM may be hard wired to each of the multipledistinct processing cores or MACs and may function to controlcomputations across the multiple distinct processing cores. As such, inthese embodiments, the programmable FSM may include a distinctconvolution register file for each of the multiple distinct processingcores. Accordingly, S220 may function to encode each of the plurality ofdistinct convolution register files of the programmable FSM with adistinct convolution type parameter value for distinct computations ofeach of the multiple distinct processing cores.

Center Channel Iterations+Data Rotation Iterations|Iteration Registers

In a preferred embodiment, the programmable FSM comprises a set ofregister files configured to store iteration parameter values. In suchpreferred embodiment, the set of register files may include at least tworegister files including a first register file (e.g., center channelregister file) and a second register file (e.g., rotation registerfile). In one or more embodiments, at runtime, S220 may function toencode the first register file with a center channel parameter valuethat identifies a number of loops or instances of a multiply-accumulatecomputation for a given convolution that should be executed by readingdata from a local memory of a processing core. Additionally, oralternatively, at runtime, S220 may function to encode the secondregister file with a rotation parameter value that may identify a seriesof operations by a processing core including a number of rotations ofmovements of data from neighboring processing cores or data ports, asshown by way of example in FIG. 5 , to the input data ports of aprocessing core and a number or loops of multiply-accumulatecomputations on this input data of a given processing core.

In one or more embodiments, a programmable FSM may include a pluralityof distinct iteration register files based on a number of processingcores that the programmable FSM may be in control or command signalcommunication. That is, the programmable FSM may be hard wired to eachof the multiple distinct processing cores or MACs and may function tocontrol a number of iterations of computations across the multipledistinct processing cores. As such, in these embodiments, theprogrammable FSM may include a distinct iteration register files foreach of the multiple distinct processing cores. Accordingly, S220 mayfunction to encode each of the plurality of distinct iteration registerfiles of the programmable FSM with a distinct iteration parameter valuefor distinct computations of each of the multiple distinct processingcores.

2.30 Controlling Convolutions|Executing the Programmable FSM

S230, which includes executing the programmable FSM, may function toexecute or implement the programmable FSM to control operations of oneor more processing cores of an integrated circuit in a fetchless manner.In a preferred embodiment, once the programmable FSM is encoded with theFSM initialization parameters, S230 may function to execute a startinstruction that enables the programmable FSM to compute and broadcastcontrol signals to each of the processing cores under its controlwithout fetching instructions from memory (e.g., an instructions bufferor the like). That is, in such preferred embodiments, an enablementand/or start of the programmable FSM is directly predicated on aninitial encoding of the programmable FSM with the FSM initializationparameters.

In use, the programmable FSM may function to produce control signals orcommands that inform computational and/or data movement (datapath)operations of a target processing core without fetching instructions.Accordingly, an operation of the programmable FSM may be a fetchlessoperation in which the computation control encoding of the programmableFSM together with a locality of data stored on the processing cores ofan array of processing cores mitigates or eliminates a requirement forthe processing cores or related circuits to perform fetches ofcomputation instructions (e.g., reads, writes, compute, etc.) frommemory and fetches of input data.

In a preferred embodiment, once the programmable FSM is encoded with FSMinitialization parameters, the components of a processing corecontrolled by the programmable FSM may function to complete a fullconvolutional computation or the like, which may include multiple loopsor iterations of a convolution, without intervening or contemporaneousfetches of instructions. That is, with an operation of the programmableFSM for controlling the operations of the processing cores, theoperations of the processing may be fetchless operations not dependenton periodically fetching instructions but dependent on one or morestates of the programmable FSM and FSM initialization parameters storedin the register files of the programmable FSM.

2.40 Fetchless Computations

S240, which includes performing fetchless computations, may function tofetchlessly execute a plurality of distinct computations and datamovements by at least one processing core of an integrated circuit basedon control signals generated by the programmable FSM. In a preferredembodiment, an implementation of the programmable FSM may be initiatedby an execution of a distinct FSM start instruction. In one embodiment,the FSM start instruction includes a Boolean switch or the like that maybe used to initiate an initial computation, such as a convolutioncomputation.

Reading Memory Start Address

In one or more embodiments, in response to an execution of the FSM startinstruction or execution of a Boolean switch, a computation circuit(e.g., a MAC) may function to read input data beginning at a startmemory address or starting offset location of a tensor based on thestart memory parameter value of the memory address register file of theprogrammable FSM. In a preferred embodiment, the memory address registerfile of the programmable FSM may include a pointer to a head of inputdata stored in a local memory of a processing core (i.e., on-processingcore memory). In such preferred embodiment, the processing core mayinclude one or more computation circuits, such as multiply accumulators,that may be in direct read-access communication with a local memory ofthe processing core. In one or more embodiments, an n-dimensional matrixor vector of data, such as a tensor of data, may be stored on the localmemory of the processing core. A read of the memory start addressparameter or memory address pointer by the computation circuit informsthe computation circuit of a position within the n-dimensional datastructure to begin reading in data for iterative computations, such asconvolutions. Depending on a convolution filter size and a size of then-dimensional data to be read, the computation circuit of the processingcore may perform multiple iterations of a computation, such as multipleconvolutional computations.

Selecting or Implementing Parameter-Informed Convolution Filter Size

Additionally, or alternatively, based on an implementation of theprogrammable FSM or execution of the FSM start instructions, S240 mayfunction to cause a computation circuit to read the convolution registerfile of the programmable FSM to identify a convolution filter size forperforming computations on input data. In one or more embodiments, theconvolution register file may include one of a plurality of distinctconvolution type parameter values that each map to a distinctconvolution filter size (e.g., 3×3, 5×5, 7×7, of the like). Upon readingthe register file, the computation circuit may perform convolutionalcomputations using the distinct convolution filter size based on theconvolution type parameter value.

In an alternative implementation, the programmable FSM may function tosend a convolution type command or control signal to a computationcircuit of a processing core based on the convolution parameter value ofthe convolution register file of the programmable FSM.

It shall be recognized that, in one or more embodiments, the computationcircuits, such as the MACs, of a processing core may be hardcoded (i.e.,hardware or circuits specifically configured) to perform convolutionalcomputations with one or more distinct convolution filter sizes. In oneembodiment, a computation circuit or MAC of a processing core may behardcoded to perform convolutional computations using a singleconvolution filter size. In such embodiments, if the read of theconvolution register file or a convolution control signal from theprogrammable FSM includes a convolution parameter value that maps to thecomputation circuit's convolution filter size, the computation circuitmay function to automatically perform the convolutional computation onan input dataset; otherwise, if the convolution parameter value orcontrol signal does not map to or match the computation circuit'sconvolution filter size, the computation may not write out its resultsor may bypass the computation (e.g., maintain an idle state).

Additionally, or alternatively, in one or more embodiments, in responseto reading the convolution register file of the programmable FSM orreceiving a convolution control signal from the programmable FSM, S240may function to cause a computation circuit of a target processing coreto toggle to an intended convolution filter size based on theconvolution parameter value derived from the convolution control signalor read from the convolution register file of the programmable FSM. Insuch embodiments, the computation circuit, such as a MAC, may behardcoded to perform multiple distinct convolutional computations.Accordingly, based on identifying a convolution parameter value orinstruction from the programmable FSM, S240 may function to cause theMAC to select a distinct state of operation of a plurality of distinctstates of operation or a distinct convolution filter size of a pluralityof distinct convolution filter sizes that maps to the convolutionparameter value associated with the convolution control signal orconvolution register file.

Accordingly, in response to either reading the convolution register fileof the programmable FSM or receiving the convolution control signal fromthe programmable FSM, the computation circuit may function to perform aconvolution computation or the like using a selected convolutionalfilter size.

Iteration Control Signals

Additionally, or alternatively, based on an implementation of theprogrammable FSM or execution of the FSM start instructions, S240 mayfunction to cause a computation circuit of a processing core toiteratively perform a computational loop, such as a convolutional loop.

In one or more embodiments, an execution of an initial computation of acomputational loop by the computation circuit may be initiated by theexecution of the FSM start instructions. Alternatively, the execution ofthe initial computation of the computational loop by the computationcircuit may be initiated by a start control signal transmitted by theprogrammable FSM directly to the computation circuit. Once a firstcomputation of a computational loop is performed and results thereofwritten to memory, the programmable FSM may identify a state of thefirst computation and generate a convolution control signal based on thestate of the first computation and an iteration parameter value of aniteration register file of the programmable FSM.

In one or more embodiments, the programmable FSM may include orimplement a counter circuit in association with the iteration registerfile. In a first implementation, the counter circuit may be initializedor set based on the iteration parameter value of the iteration registerfile. For example, if the iteration register file stores an iterationparameter value of seven (7), S240 may function to program or set thecounter circuit to seven. In this first implementation, each computationor write signal by the computation circuit may cause the counter circuitto decrement by one (1). In such embodiment, the programmable FSM mayevaluate a state of the counter circuit against logic that, if satisfied(or not), may cause the programmable FSM to assert an iteration controlsignal until the seven iterations of the computational loop arecompleted by the computation circuit. In a second implementation, thecounter circuit may be set to an initial value, such as zero (0), andincremented based on each computation completed in a computation loop oreach write to memory by the computation circuit. In this secondimplementation, S240 may implement the programmable FSM to evaluate astate of the counter circuit against the iteration parameter value andcontinue generation of iteration control signals for causing thecomputation circuit to perform an iteration of a computational loopuntil a value of the counter circuit matches the convolution parametervalue.

It shall be recognized that, in a preferred embodiment, the programmableFSM may include or may function to implement at least two distinctiteration register files and associated distinct counter circuits foreach of the at least two distinct iteration register files. In thispreferred embodiment, a first iteration register file may be configuredto store an iteration parameter value that delineates the number ofiterations or cycles/loops to be performed by a computation circuit,such as a MAC, using or reading the input data actively stored in thelocal memory of the MAC. A second iteration register file may beconfigured to store an iteration parameter value that delineates a datamovement, which may include a first bit identifying one or moredestination sides of a processing core to which local data may be sentto neighboring processing cores, and a second bit identifying a datamovement, such as a degree or angle of a rotational movement of inputdata from neighboring processing cores to the output data to otherneighboring processing cores. In such embodiments, once the data may bepositioned or moved to the designated destination side, the continuedexecution of the iteration parameter value of the second register filemay include executing an iteration of a computational loop on the inputdata.

Additionally, or alternatively, in one or more embodiments, executingcomputations on an array of processing cores may include an execution ofone or more input bit movement instructions including, but not limitedto, an input bit rotation, an input bit core hop, an input swap, and/orthe like.

The systems and methods of the preferred embodiment and variationsthereof can be embodied and/or implemented at least in part as a machineconfigured to receive a computer-readable medium storingcomputer-readable instructions. The instructions are preferably executedby computer-executable components preferably integrated with the systemand one or more portions of the processor and/or the controller. Thecomputer-readable medium can be stored on any suitable computer-readablemedia such as RAMs, ROMs, flash memory, EEPROMs, optical devices (CD orDVD), hard drives, floppy drives, or any suitable device. Thecomputer-executable component is preferably a general or applicationspecific processor, but any suitable dedicated hardware orhardware/firmware combination device can alternatively or additionallyexecute the instructions.

Although omitted for conciseness, the preferred embodiments includeevery combination and permutation of the implementations of the systemsand methods described herein.

As a person skilled in the art will recognize from the previous detaileddescription and from the figures and claims, modifications and changescan be made to the preferred embodiments of the invention withoutdeparting from the scope of this invention defined in the followingclaims.

What is claimed:
 1. A method for accelerating an execution ofcomputational loops on an integrated circuit, the method comprising:programming a finite state machine (FSM) based on a loop iterationparameter comprising a number of computation cycles of a computationalloop to be executed by a computational circuit; at runtime, executingthe FSM based on a start signal, wherein executing the FSM includes: (i)generating, by the FSM, a plurality of control signals including adistinct control signal for each of the number of computation cycles ofthe computational loop; and (ii) controlling, by the FSM, an operationof the computational circuit executing the computational loop based on atransmission of the plurality of control signals to the computationalcircuit.
 2. The method according to claim 1, wherein the FSM iscontrollably connected to a plurality of processing cores, each of theplurality of processing cores having at least one computational circuit.3. The method according to claim 1, wherein at runtime, the FSM isexecuted without performing fetches of computational loop instructions.4. The method according to claim 1, further comprising: programming theFSM based on a data movement parameter comprising at least one datamovement instruction that, when executed, moves input data from aregister file of a first processing core to data input ports ofneighboring processing cores.
 5. The method according to claim 4,wherein at runtime, the FSM is executed without performing fetches ofdata movement instructions.
 6. The method according to claim 4, whereinthe register file is associated with one or more data output ports ofthe first processing core and data input ports of the first processingcore, wherein the data input ports of the first processing core aredirectly connected to data output ports of the neighboring processingcores; and executing the at least one data movement instruction causesthe input data to rotate an angle from the data input ports of the firstprocessing core to the one or more data output ports of the firstprocessing core.
 7. The method according to claim 4, wherein at runtime,executing the FSM causes an execution of the computational loop based onthe loop iteration parameter, and subsequently, executing the FSM causesan execution of one or more computational loops based on the loopiteration parameter and the data movement parameter.
 8. The methodaccording to claim 4, wherein at runtime, the FSM generates: a first setof control signals of the plurality of control signals for executing thecomputational loop based on the loop iteration parameter; and inresponse to completing the computational loop based on the loopiteration parameter, a second set of control signals of the plurality ofcontrol signals for executing (a) the number of computation cycles ofthe computational loop and (b) the at least one data movementinstruction based on the loop iteration parameter and the data movementparameter.
 9. The method according to claim 1, wherein programming theFSM includes identifying, by the FSM, a distinct data movement controlsignal for each of the number of computation cycles of the computationalloop based on the loop iteration parameter and a data movementparameter.
 10. The method according to claim 9, wherein controlling theoperation of the computational circuit executing the computational loopincludes transmitting, by the FSM, the distinct data movement controlsignal for each of the number of computation cycles of the computationalloop until the number of computation cycles of the computational loopare completed.
 11. The method according to claim 1, wherein programmingthe FSM includes encoding a starting memory address parameter to a startmemory address register file accessible to one or more computationalcircuits controllable by the FSM.
 12. The method according to claim 11,wherein the starting memory address parameter comprises a register filepointer that points to a head of input data at a location within ann-dimensional memory stored within at least one processing corecontrollable by the FSM.
 13. The method according to claim 1, whereinprogramming the FSM includes encoding a convolution filter sizeparameter to a convolution register file of at least one processing corecontrollable by the FSM.
 14. The method according to claim 13, whereinthe convolution filter size parameter comprises a value that maps to oneof a plurality of distinct convolutional filter sizes for a givenconvolutional computation by a multiply accumulator circuit of the atleast one processing core.
 15. The method according to claim 1, whereinprogramming the FSM includes encoding the loop iteration parameter to acombination of distinct iteration register files of at least oneprocessing core controllable by the FSM.
 16. The method according toclaim 1, wherein: at runtime, the FSM generates the plurality ofcontrols signals causing an execution of an N-way multiply accumulatewith computation weights and computation input data, wherein: N relatesto a number of distinct multiply accumulate circuits concurrentlyexecuting a distinct computational loop, and N is greater than one. 17.The method according to claim 1, wherein if a convolution filter sizeparameter of the FSM includes a value that maps to one of a plurality ofdistinct convolutional filter sizes that is greater than a 1×1convolutional filter size, the FSM broadcasts input data pointed to by astarting memory address parameter to a collection of processing cores inneighboring proximity to the FSM.
 18. A method comprising: programming afinite state machine (FSM) based on one or more FSM initializationparameters, wherein the one or more FSM initialization parametersinclude a loop iteration parameter comprising a number ofmultiply-accumulate computation cycles of a convolutional loop; atruntime, implementing the FSM to enable one or more computations by: (i)generating, by the FSM, a plurality of convolutional loop controlsignals based on the loop iteration parameter; and (ii) controlling, bythe FSM, an execution of a plurality of multiply-accumulate computationcycles of a multiply accumulator circuit (MAC) performing theconvolutional loop based on transmitting the plurality of convolutionalloop control signals until the number of multiply-accumulate computationcycles of the convolutional loop are completed.
 19. The method accordingto claim 18, wherein programming the FSM includes: (i) programming astarting memory address parameter at a start memory address registerfile accessible to the MAC; (ii) programming a convolution filter sizeparameter at a convolution register file accessible to the MAC; and(iii) programming one or more iteration parameters at one or moreiteration register files accessible to the FSM.
 20. A method forimplementing finite state machine (FSM)-controlled convolutionalcomputations on an integrated circuit, the method comprising:configuring an FSM based on one or more FSM programming instructions,wherein the FSM controls: (a) computations of multiply accumulatorcircuits (MACs) of a plurality of distinct processing cores, and (b)data movement operations of data ports of the plurality of distinctprocessing cores; wherein configuring the FSM includes: (1) encoding astarting memory address value to an address register file accessible tothe MACs of the plurality of distinct processing cores, (2) encoding aconvolutional filter size to a convolutional register file associatedwith the FSM, and (3) encoding an iteration value to at least oneiteration register file associated with the FSM, wherein the iterationvalue identifies a number of cycles of a convolutional loop performed byat least one of the MACs; and executing a Boolean switch based on theconfiguring of the FSM that starts an operation of the FSM forgenerating control signals to the MACs for automatically executing oneor more distinct convolutional loops.